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Abstract 

We propose a new algorithm for minimizing reg¬ 
ularized empirical loss: Stochastic Dual Newton 
Ascent (SDNA). Our method is dual in nature: in 
each iteration we update a random subset of the 
dual variables. However, unlike existing methods 
such as stochastic dual coordinate ascent, SDNA 
is capable of utilizing all curvature information 
contained in the examples, which leads to strik¬ 
ing improvements in both theory and practice - 
sometimes by orders of magnitude. In the spe¬ 
cial case when an L2-regularizer is used in the 
primal, the dual problem is a concave quadratic 
maximization problem plus a separable term. In 
this regime, SDNA in each step solves a prox¬ 
imal subproblem involving a random principal 
submatrix of the Hessian of the quadratic func¬ 
tion; whence the name of the method. If, in addi¬ 
tion, the loss functions are quadratic, our method 
can be interpreted as a novel variant of the re¬ 
cently introduced Iterative Hessian Sketch. 

1. Introduction 

Empirical risk minimization (ERM) is a fundamental 
paradigm in the theory and practice of statistical infer¬ 
ence and machine learning (Shalev-Shwartz & Ben-David, 
2014). In the “big data” era it is increasingly common in 
practice to solve ERM problems with a massive number of 
examples, which leads to new algorithmic challenges. 

State-of-the-art optimization methods for ERM include i) 
stochastic (sub)gradient descent (Shalev-Shwartz et ah, 


2011; Takac et ah, 2013), ii) methods based on stochastic 
estimates of the gradient with diminishing variance such 
as SAG (Schmidt et ah, 2013), SVRG (Johnson & Zhang, 
2013), S2GD (Konecny & Richtarik, 2014), proxSVRG 
(Xiao & Zhang, 2014), MISO (Mairal, 2014), SAGA 
(Defazio et ah, 2014), minibatch S2GD (Konecny et ah, 
2014a), S2CD (Konecny et ah, 2014b), and iii) variants 
of stochastic dual coordinate ascent (Shalev-Shwartz & 
Zhang, 2013d; Zhao & Zhang, 2014; Takac et ah, 2013; 
Shalev-Shwartz & Zhang, 2013b;a; Lin et ah, 2014; Qu 
et ah, 2014; Shalev-Shwartz & Zhang, 2013c). 

There have been several attempts at designing methods that 
combine randomization with the use of curvature (second- 
order) information. For example, methods based on run¬ 
ning coordinate ascent in the dual such as those mentioned 
above and also (Richtarik & Takac, 2014; 2012; Fercoq 
& Richtarik, 2013b; Tappenden et ah, 2014; Richtarik & 
Takac, 2013a;b; Fercoq & Richtarik, 2013a; Fercoq et ah, 
2014; Qu et ah, 2014; Qu & Richtarik, 2014a) use cur¬ 
vature information contained in the diagonal of a bound 
on the Hessian matrix. Block coordinate descent methods, 
when equipped with suitable data-dependent norms for the 
blocks, use information contained in the block diagonal of 
the Hessian (Tappenden et ah, 2013). 

A more direct route to incorporating curvature information 
was taken by Schraudolph et ah (2007) in their stochastic 
L-BFGS method, by Byrd et ah (2014) and Sohl-Dickstein 
et ah (2014) in their stochastic quasi-Newton methods 
and by Fountoulakis & Tappenden (2014) who proposed 
a stochastic block coordinate descent methods. While typ¬ 
ically efficient in practice, none of the methods mentioned 
above are equipped with complexity bounds (bounds on the 
number of iterations). An exception in this regard is the 
work of Bordes et ah (2009), who give a ()(1 /e) complex- 
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ity bound for a Quasi-Newton SGD method. 

1.1. Contributions 

The main contribution of this paper is the design and anal¬ 
ysis of a new algorithm— stochastic dual Newton ascent 
(SDNA) —for solving a regularized ERM problem with 
smooth loss functions and a strongly convex regularizer 
(primal problem). Our method is stochastic in nature and 
has the capacity to utilize all curvature information inher¬ 
ent in the data. While we do our analysis for an arbitrary 
strongly convex regularizer, for the purposes of the intro¬ 
duction we shall describe the method in the case of the L2 
regularizer. In this case, the dual problem is a concave 
quadratic maximization problem with a strongly concave 
separable penalty. 

SDNA in each iteration picks a random subset of the 
dual variables (which corresponds to picking a mini¬ 
batch of examples in the primal problem), following 
an arbitrary probability law, and maximizes, exactly, 
the dual objective restricted to the random subspace 
spanned by the coordinates. Equivalently, this can be 
seen as the solution of a proximal subproblem involving a 
random principal submatrix of the Hessian of the quadratic 
function. Hence, SDNA utilizes all curvature informa¬ 
tion available in the random subspace in which it oper¬ 
ates. Note that this is very different from the update strat¬ 
egy of parallel / minibatch coordinate descent methods. In¬ 
deed, while these methods also update a random subset of 
variables in each iteration, they instead only utilize curva¬ 
ture information present in the diagonal of the Hessian. 

As we will explain in detail in the text, SDCA-like meth¬ 
ods need more iterations (and hence more passes through 
data) to convergence as the minibatch size increases. How¬ 
ever, SDNA enjoys the opposite behavior: with increas¬ 
ing minibatch size, SDNA needs fewer iterations (and 
hence fewer passes over data) to convergence. This ob¬ 
servation can be deduced from the complexity results we 
prove for SDNA, and is also confirmed by our numerical 
experiments. In particular, we show that the expected du¬ 
ality gap decreases at a geometric rate which i) is better 
than that of SDCA-like methods such as SDCA (Shalev- 
Shwartz & Zhang, 2013d) and QUARTZ (Qu et ah, 2014), 
and ii) improves with increasing minibatch size. This im¬ 
provement does not come for free: as we increase the mini¬ 
batch size, the subproblems grow in size as they involve 
larger portions of the Hessian. We find through experi¬ 
ments that for some, especially dense problems, even rel¬ 
atively small minibatch sizes lead to dramatic speedups 
in actual runtime. 

We show that in the case of quadratic loss, and when 
viewed as a primal method, SDNA can be interpreted 
as a variant of the recently introduced Iterative Hessian 


Sketch algorithm (Pilanci & Wainwright, 2014). 

En route to developing SDNA which we describe in Sec¬ 
tion 4, we also develop several other new algorithms: 
two in Section 2 (where we focus on smooth problems), 
one in Section 3 (where we focus on composite problems). 
Besides SDNA, we also develop and analyze a novel mini¬ 
batch variant of SDCA in Section 4, for the sake of find¬ 
ing suitable method to compare SDNA to. SDNA is equiv¬ 
alent to applying the new method developed in Section 3 to 
the dual of the ERM problem. However, as we are mainly 
interested in solving the ERM (primal) problem, we addi¬ 
tionally prove that the expected duality gap decreases at a 
geometric rate. Our technique for doing this is a variant 
of the one use by Shalev-Shwartz & Zhang (2013d), but 
generalized to an arbitrary sampling. 

1.2. Notation 

Vectors. By ei,..., e n we denote the standard basis vec¬ 
tors in R n . For any x £ R n , we denote by Xi the zth el¬ 
ement of x, i.e., Xi = ejx. For any two vectors x 1 y of 
equal size, we write (x, y) = x T y = JR and by x o y 
we denote their Hadamard (i.e., elementwise) product. We 
also write it -1 = (uj" 1 ,..., it -1 ). 

Matrices. I is the identity matrix in R" x and D ( w ) is 
the diagonal matrix in R nxra with w £ R” on its diagonal. 
We will write MAO (resp. M >- 0) to indicate that M is 
positive semidefinite (resp. positive definite). 

Subsets of coordinates. Let S' be a nonempty subset of 
[n] := {1,2,..., n}. For any matrix M £ R nxn we write 
Mg for the matrix obtained from M by retaining elements 
M, , for which both i £ S and j £ S and zeroing out all 
other elements. Clearly, Mg = IgMIg. Moreover, for 


any vector h £ R n we write 

hs ■= Ish = Ya=i hid. (1) 

Note that we can thus write 

(h s ) T Mh s = h T I s MI s h = h T M s h, (2) 
and that for x,y £ R n we have 

(xs,y) = (I sx,y) = (x,I sy) = (x,y s ). (3) 

By (Mg) -1 we denote the matrix in R nxn for which 

(Ms) -1 Mg = Mg(Mg) -1 = Ig. (4) 

2. Minimization of a Smooth Function 


In this section we consider unconstrained minimization of 
a differentiable convex function: 

min fix). (5) 

zeM" J K ' 
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In particular, we shall assume smoothness (Lipschitz con¬ 
tinuity of the gradient) and strong convexity of /: 

Assumption 1 (Smoothness). There is a positive definite 
matrix M £ R nx " such that for all x,h £ R", 

fix + h)< f{x) + (Vf(x),h) + \ (M h, h) (6) 

Assumption 2 (Strong convexity). There is a positive def¬ 
inite matrix G £ R" xn such that for all x, h £ K", 

f{x) + <V/(s), h) + l -{Gh, h) < f(x + h). (7) 

2.1. Three stochastic algorithms 

We now describe three algorithmic strategies for solving 
problem (5), the first two of which are new. All these 
methods have the form 

x k+1 ^x k + h k , (8) 

where h k is only allowed to be nonzero for i £ Sk, where 
{Sk}k>o are i.i.d. random subsets of [n] := {1, 2,..., n} 
(“samplings”). That is, all methods in each iteration up¬ 
date a random subset of the variables. The four methods 
will only differ in how the update elements h k for i £ Sk 
are computed. If we wish the methods to work, we neces¬ 
sarily need to require that every coordinate has a positive 
probability of being sampled. For certain technical reasons 
that will be apparent later, we will also assume that Sk is 
nonempty with probability 1. 

Assumption 3 (Samplings). The random sets {Sk } k>o are 

1.1. d., proper (i.e., Prob(* £ Sk) > 0 for all i £ [n\) and 
nonvacuous (i.e., Prob(«Sfc = 0) = 0). 

Much of our discussion will depend on the distribution of 
Sk rather than on k. As {<S'fe}fc>o are i.i.d., we will write 
S for a sampling which shares their distribution. We will 
write p = ipi,... ,p n ) where 

Pi := Prob(i £ S), i £ [n]. (9) 

By Assumption 3, we have pt > 0 for all i. We now de¬ 
scribe the methods. 

Method 1. We compute (MjJ -1 and set 

h k = -(M Sh )- 1 Vf(x k ). (Method 1) 

Note that the update only involves the inversion of a ran¬ 
dom principal submatrix of M of size \Sk\ x \Sk\. Also, 
we only need to compute elements i £ Sk of the gradi¬ 
ent Vfixk). If |Sk is reasonably small, the update step is 
cheap. 

Method 2. We compute the inverse of E[M S ] and set 

h k = -IsJEpVls])"^ ip)S7fix k ). (Method 2) 


This strategy easily implementable when |= 1 with 
probability 1 (i.e., if we update a single variable only). This 
is because then E[M S ,] is a diagonal matrix with the {i, i) 
element equal to Hence, the update step simplifies 

to h k = - (e^ V/(x fc )) for i £ Sk and h k = 0 for 

* ^ Sk- For more complicated samplings S, however, the 
matrix E[M^] will be as hard to invert as M. 

Method 3. We compute a vector v £ M" for which 

E[Mj] A D(p)D(v) (10) 

and then set 

h k = -I Sk (D(u)) -1 V/(a; fc ). (Method 3) 

Assuming v is easily computable (this should be done be¬ 
fore the methods starts), the update is clearly very easy to 
perform. Indeed, the update can be equivalently written as 
h k = V/( x k )) for i £ S k and h\ = 0 for i £ S k . 

Method 3 is known as NSync (Richtarik & Takac, 2013b). 
For a calculus allowing the computation of closed form for¬ 
mulas for v as a function of the sampling S we refer the 
reader to (Qu & Richtarik, 2014b). 

Note that all three methods coincide if | .S' = 1 with prob¬ 
ability 1. 

2.2. Three linear convergence rates 

We shall now show that, putting the issue of the cost of each 
iteration of the three methods aside, all enjoy a linear rate 
of convergence. In particular, we shall show that Method 
1 has the fastest rate, followed by Method 2 and finally. 
Method 3. 

Theorem 1. Let Assumptions 1, 2 and 3 be satisfied. Let 
{a; fc }fc>o be the sequence of random vectors produced by 
Method m, for m = 1, 2, 3 and let x* be the optimal solu¬ 
tion of (5). Then 

nfix k+1 ) - fix*)} < (1 - a m ) E[fix k ) - fix*)}, 

where 

cri :=A min (G^E^Mj) -1 
<r 2 := A min (G 1 / 2 D(p) (E [M^-'D^G 1 / 2 ) , (12) 
<r 3 := A min (G^DfoptOG 1 / 2 ) • (13) 

The above result means that the number of iterations suffi¬ 
cient for Method m to obtain an e-solution (in expectation) 

is °(T!7 lo g( 1 / e ))- 

In the above theorem (which we prove in Section 2.4), 
A mi n (X) refers to the smallest eigenvalue of matrix X. It 
turns out that in all three cases, the matrix X involved is 
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positive definite. However, for the matrices in (11) and (12) 
this will only be apparent if we show that E[Mg] >- 0 and 
E[(Mj) _1 ] >- 0, which we shall do next. 

Lemma 1. If S is a proper sampling, then E [Mg] >- 0. 


2.3. Example 

Consider the function / : M 3 —> R given by 

/1.0000 0.9900 0.9999\ 

f(x) = \x T M.X , M = 0.9900 1.0000 0.9900 

\0.9999 0.9900 1.0000/ 


Proof. Denote supp{x} = {i £ [ n ] : x t 0}. Since M >- 
0, any principal submatrix of M is also positive definite. 
Hence for any x £ R n \{0}, i T Msi = 0 implies that 
suppjx} D S = 0 for all SC [n]. If x £ R n is such that 


Note that Assumption 1 holds, and Assumption 2 holds 
with G = M. Let S be the “ 2 -nice sampling” on [n] = 
{1,2,3}. That is, we set Prob(S f = {*,/}) = |. for 
(i,j) = (1, 2), (2, 3), (3,1). A straightforward calculation 
reveals that: 


x T E [Mj] x = Esc[n] Prob(5 = S)x T M s x = 0 , 


then Prob(supp{x} D S = 0) = 1. Since S is proper, this 
only happens when x = 0. Therefore, E[M S ] >- 0. □ 


E 



/ 1683.50 -16.58 -1666.58\ 

( -16.58 33.50 -16.58 I , 

\ —1666.58 -16.58 1683.50 ) 


1 / 0.9967 -0.3268 -0.3365\ 

Dip) (E [MaI ) Dip) « -0.3268 0.9902 -0.3268 

\ —0.3365 -0.3268 0.9967 J 


Lemma 2. If S is proper and nonvacuous, then 


0 -< D(p) (E [Mj]) 1 D(p)^E[(M^) 1 


It can be verified that (10) holds with v = (2,2,2); see 
(Richtarik & Takac, 2012) or (Qu & Richtarik, 2014b). 
(14) Therefore, D(p)D(r> -1 ) = |l. Finally, we obtain: 


Proof. The first inequality follows from Lemma 1 and the 
fact for proper S we have p > 0 and hence D(p) >- 0. 
We now turn to the second inequality. Fix h £ R n . For 
arbitrary 0 S C [n] and y £ R n we have: 

\h T ( M s )- 1 h=\h r s (Ms) -1 hs 
= max(i v,h s ) - ±x T M s x > ( y,h s ) - \y T Msy. 


a i w 0.3350, a 2 « 1.333 • 10 -4 , cr 2 « 0.333 • 10 -4 . 

Note that: theoretical rate, <ji, of Method 1 is 10,000 
times better than the rate, < 73 , of parallel coordinate de¬ 
scent (Method 3). 

2.4. Proof of Theorem 1 

Proof. By minimizing both sides of (7) in h, we get: 


Substituting S = S and taking expectations, we obtain 


4E 


h r (M § ) 1 h 


> E [(y,h § ) - ±y T Mgy] 


= y T D(p)h-y T E[Mg]y. 


Therefore, \ h T E 
4t/ T E[M 5 ]y = 


1 

2 


(Ms ) _1 

h T D(p) 


h > max y6 Rr> y T D(p)h 
(E [Mg,]) -1 D[p)h. 


O 


We now establish an important relationship between the 
quantities <j\, a 2 and <r 3 , which sheds light on the conver¬ 
gence rates of the three methods. 

Theorem 2. 0 < cr 3 < 02 < o\ < 1. 


Proof. We have a m > 0 for all m since <j m is the 
smallest eigenvalue of a positive definite matrix. That 
cr m < 1 follows as a direct corollary Theorem 1. Fi- 

( 10 ) 

nally, D(p)D(t" 1 ) = D(p)D(p- 1 )D(x" 1 )D(p) A 


D(p)(E[Mj]) 1 D(p) ( a’ E 


(Me 


, -1 


□ 


f(x) - f(x*) < ^(v/(x), G -1 V/(x)). (15) 

In view of (6) and (2), for for all h £ R" we have: 

f(x k +I Sk h) < f(x k ) + (V/(x fe ), I Sk h) + l(M Sk h, h). 

(16) 

Method 1: If we use (16) with h ■£- h k := 
-(MsJ -1 V/(x fe ), and apply (4), we get: 

f(x k+1 ) < f(x k ) - i(V/(x fc ), (M s J -1 V/(x fe )). 

Taking expectations on both sides with respect to Sk yields: 

E k[f(x k+1 )) 

< f(x k ) - i(V/(x fe ),E[(Mg) -1 ]V/(x fe )) 

( < f(x k )- a ^(S7f(x k ),G- 1 S7f(x k )) 

( < f(x k ) - (7! ( f(x k ) - f(x*)) , 

where E* denotes the expectation with respect to Sp : ■ It 
remains to rearrange the inequality and take expectation. 
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Method 2: Let D = Dip). Taking expectations on 
both sides of (16) with respect to Sk, we see that for 
all h £ ] R" the following holds: Ek[f(x k + Is fc /i)] < 
f(x k ) + (DV f(x k ), h) + h). Note that the 

choice h k := — (E[M^]) _1 DV/(x fe ) minimizes the right 
hand side of the inequality in h. Since h k = I s k h k , 

E k[f(x k+1 )} 

< /(o: fe )-i(V/(x fe ),D(E[M5])- 1 DV/(a: fe )) 

'< f(x k ) - ^(V/(x fe ), G _1 V/(a; fe )) 

(15) 

< f(x k ) - 02 ( f(x k ) - f(x*)) . 

Method 3: The proof is the same as that for Method 

2, except in the first inequality we replace E[MsJ by 

D(p)D(u) (see (10)). □ 

3. Minimization of a Composite Function 

In this section we consider the following composite mini¬ 
mization problem: 


min F(x) = f(x) + V] ^i{xi). (17) 

xeR n 

i —1 

We assume that / satisfies Assumptions 6 and 7. The dif¬ 
ference from the setup in the previous section is in the in¬ 
clusion of the separable term '‘Pi- 

Assumption 4. For each i, i/p : M -> RU {+oo} ‘ s 
closed and ji-strongly convex for some 7 j > 0. Let 
7= ( 71 . • --)7n) eK+. 

For ease of presentation, in this section we only consider 
uniform sampling S, which means that Prob(i £ S) = 
Problj £ S) for all i,j £ [n\. In particular, this implies 
that Prob(i £ S) = for all i. Let r := E[5 1 ]. 


3.1. New algorithm 

We now propose Algorithm 1, which a variant of Method 
1 applicable to problem (17). If 'ipi = 0 for all i, the meth¬ 
ods coincide. The following result states that the method 
converges at a geometric rate, in expectation. 

Theorem 3. Let Assumptions 1, 2, 3 and 4 be satisfied. 
Then the output sequence {x k }k>o of Algorithm 1 satisfies: 

E[F(x k+1 ) - F(z*)] < (1 - a prox )E[F(x k ) - F(cc*)], 

where x* is the solution of (17), dj , '“ := Tmin r [ 1,Sl ^ and 


Si :— A„ 


(™E[Mj]+D( 7 )) 1 (D( 7 ) + G) 


Algorithm 1 Proximal version of Method 1 

1: Parameters: uniform sampling S 
2: Initialization: choose initial point x° £ K" 

3: for k = 0, 1, 2, ... do 
4: Generate a random set of blocks Sk ~ S 

5: Compute: h k = argmin heR „(V/(a: fc ), h Sk ) + 

M Sk h) + £ ieSfc f>i(x k + hf, 

6: Update: x k+1 := x k + hg k 

7: end for 


Note for positive definite matrices X, Y, we have 
A min (X^ 1 Y) = A m i n (Y 1 / 2 X _1 Y 1 / 2 ). It is this latter 
form we have used in the formulation of Theorem 1 . In the 
special case when 7 = 0 (tpi are merely convex), we have 
a p r x = min{^, pA min (G 1/2 (E[Mj])- 1 G 1 / 2 )}. Note 
that while this rate applies to a proximal/composite vari¬ 
ant of Method 1, its rate is best compared to the rate 02 of 
Method 2. Indeed, looking at (12), and realizing that for 
uniform S we have D(p) = -I, we get 

<?i > <72 = ^A min (G 1 / 2 (E[M^])- 1 G 1 / 2 ) > a prox . 

So, the rate we can prove for the composite version of 
Method 1 (a prox ) is weaker than the rate we get for Method 
2 ( 02 ), which by Theorem 2 is weaker than the rate of 
Method 1 ((j 1 ). We believe this is a byproduct of our anal¬ 
ysis rather than the weakness of Algorithm 1. 

3.2. PCDM 

We will now compare our new Algorithm 1 with the Par¬ 
allel Coordinate Descent Method (PCDM) of Richtarik & 
Takac (2012), which can also be applied to problem (17). 


Algorithm 2 PCDM (Richtarik & Takac, 2012) 

1: Parameters: uniform sampling S;v £ + 

2: Initialization: choose initial point x° £ K" 

3: for k = 0,1, 2,... do 

4: Generate a random set of blocks Sk ~ S 

5: Compute for i £ Sk 

h k = engmmeJ\/f(x k )h i + Ph\h i \ 2 + fi>i(x k + h i ) 

hie R 2 

6: Update: x k+1 := x k + YlieS k ^ i e i 

7: end for 


Proposition 1. Let the same assumptions as those in The¬ 
orem 3 be satisfied. Moreover, assume v £ M" , is a vector 
satisfying (10). Then the output sequence {x fc }fc>o of Al¬ 
gorithm 2 satisfies 

E[F(x k+1 ) - F(a;*)] < (1 - o prox )E[F{x k ) - F(x*)], 

/ prox rmin(l,S3) » 

where := - v y ana 

o n 


s 3 := A 


min 


(D(w + 7 )) -1 (D( 7 ) + G) 
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Proof. Sketch: The proof is a minor modification of the 
arguments in (Richtarik & Takac, 2012). □ 

3.3. Comparison of the rates of Algorithms 1 and 2 

We now show that the rate of linear (geometric) conver¬ 
gence of our method is better than that of PCDM. 

Proposition 2. af ox > a p 3 rox . 

Proof. Since p, = - for all i, we have D(p) = -I and 
hence from ( 10 ) we deduce that: 

77 (10) 

- E\M S ] + D( 7 ) P D(u) + D( 7 ) = D(v + 7 ), 

whence si > S 3 , and the claim follows. □ 


4. Empirical Risk Minimization 

We now turn our attention to the empirical risk minimiza¬ 
tion problem: 


mm P(w) := + \g(w). (18) 

i—1 

We assume that g : —> K. is a 1-strongly convex func¬ 
tion with respect to the L2 norm and each loss function 
(f)i : R —> R is convex and l/ 7 -smooth. Each a,i is a d- 
dimensional vector and for ease of presentation we write 
A = (ai,..., a n ) = E"=i a i e J ■ Let 9* and {>*}* be 
the Fenchel conjugate functions of g and respec¬ 

tively. In the case of g , for instance, we have g*(s) = 
sup^ggd (w, s)—g(w). The (Fenchel) dual problem of (18) 
can be written as: 

n 

'■= - Xg* . (19) 

ckEK 71 ^— 

i—1 

4.1. SDNA: A new algorithm for ERM 

Note that the dual problem has the form (17) 

n 

min F(a) = f(a) + ^^( 07 ), (20) 

a£K n z ' 

i =1 

where F(a) = —D(a), /(a) = Xg*(j^Aa) and^(ai) = 
It is easy to see that / satisfies Assumption 1 
with M := -X, where X := A A. Moreover, fi, is 
^-strongly convex. We can therefore apply Algorithm 1 to 
solve the dual (20). This is what Algorithm 3 does. 

If a* is the optimal solution of (19), then the optimal solu¬ 
tion of (18) is given by: 

w *= V 5*(^ A «*)- (2!) 


Algorithm 3 Stochastic Dual Newton Ascent (SDNA) 

1: Parameters: proper nonvacuous sampling S 
2: Initialization: a 0 £ K”; a 0 = ^Aa° 

3: for k = 0,1,2, ... do 
4: Primal update: w k = S7g*(a k ) 

5: Generate a random set of blocks Sk ~ S 

Ac^ = argmin((A T w k )s k ,h) + \h T X Sk h 

h£WL n 

+ J2iGS k - hi) 

7: Dual update: a k+1 := a k + (A a k )s k 

8: Average update: a k+1 = ct k + ^ J2ies k ^ a i a i 

9: end for 


With each proper sampling S we associate the number: 


8(S) := min 

i 


PiXp/n 

Vi + X 7 n ’ 


( 22 ) 


where (pi, ■ ■ ■ ,p n ) is the vector of probabilities defined 
in (9) and v = (v\,, v n ) £ R™, is a vector satisfying: 


E[(A T A)g] ± D(p)D(u). (23) 


Closed-form expressions for v satisfying this inequal¬ 
ity, as a function of the sampling S chosen, can be 
found in (Qu & Richtarik, 2014b). A rather conservative 
choice which works for any S, irrespective of its distribu¬ 
tion, is Vi = min{r, A , (A T A)}||aj|| 2 , where A'(Y) := 
max h {h T Yh : h Tl D{Y)h < 1} and r is a number for 
which |Sj < r with probability 1 (see Theorem 5.1 in the 
aforementioned reference). Better bounds (with smaller v) 
can be derived for special classes of samplings. 


Now we can state the main result of this section: 

Theorem 4 (Complexity of SDNA). Let S be a uniform 
sampling and let r := E[|Sj]. The output sequence 
{w k , a k }fc>o of Algorithm 3 satisfies: 


/-. prox\l c 

E[P( W fc ) - D(a k )} < [ ~ a \ > (D(a*) - D(a 0 )), 

/ prox rminfl.si) » 

where oK := - v ' y ana 

1 n 


si — A n 


— E[(A T Ay+I 


-1 


(24) 


In the case of quadratic losses and quadratic regularizer, we 
can sharpen the complexity bound: 

Theorem 5. When both <pi and g are quadratic functions, 
the output sequence {w k , a k }k>o of Algorithm 3 satisfies: 

E[P( W fe ) - D(a k )} < (1 ~^ l)fc (£>(«*) - D(a 0 )) 

where 

((atA T A + 7I)s)" 1 (^A t A + 7 I)1' 


<7i :— A n 


E 
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4.2. Complexity analysis 

We first establish that SDNA is able to solve the dual. 

Lemma 3. Let S be a uniform sampling and r := E[|«Sj], 
The output sequence {cA }k>o of Algorithm 3 satisfies: 

E[D(a*) - D(a k )} < (1 - a \ rox ) k ( D(a *) - L>(a 0 )), 

where (jP rox is as in Theorem 4. 

Proof If S is uniform, then the output of Algorithm 3 is 
equivalent to the output of Algorithm 1 applied to (20). 
Therefore, the result is obtained by applying Theorem 3 
with M = ^jA A, G = 0 and 7 'i = % for all i. □ 

We now prove a sharper result in the case of quadratic loss 
and quadratic regularizes 

Lemma 4. If and g are quadratic, then the output 
sequence {a k }k> 0 of Algorithm 3 satisfies: 

E[£>(«*) - D(a k )] < (1 - ai) fe (D(a*) - D(a °)), 

where a\ is as in Theorem 5. 

Proof If {(f>i}i and g are all quadratic functions, then the 
dual objective function is quadratic with Hessian matrix 
given by V 2 Z9(a) = jpA T A + ^1. It suffices to apply 
Theorem 1(11), with M = G = \7 2 D(a). □ 

We now prove a more general version of a classical result 
in dual coordinate ascent methods which bounds the duality 
gap from above by the expected dual increase. 

Lemma 5. The output sequence {w k ,a k } k >0 of Algo¬ 
rithm 3 satisfies: 

E k [D{a k+l ) - D(a k )} > 9(S)(P(w k ) - D{a k )). 


a different primal update than SDNA. Hence, in order to 
compare SDNA with an SDCA-like method which is as 
close a match to SDNA as possible, we need to develop 

a new method. Algorithm 4 is an extension of SDCA to 
allow it handle an arbitrary uniform sampling S. 

The complexity of Minibatch SDCA (we henceforth just 
write SDCA) is given in Theorem 6. 

Theorem 6. If (23) holds, then the output sequence 
{w k , a k } k > 0 of Algorithm 4 satisfies: 

E[P{w k ) - D(a k )] < ^ ~ 9 [ S ^ k ( D(a *) - D(a 0 )) . 


Algorithm 4 Minibatch SDCA 

1: Parameters: uniform sampling S, vector v £ M7, 

2: Initialization: a 0 £ R”; set <5° = ^-Aa° 

3: for k = 0,1,2, ... do 
4: Primal update: w k = Vg*(a k ) 

5: Generate a random set of blocks S k ~ S 

6: Compute for each i £ S k 

= argmin hi(ajw k ) + ^|/ii | 2 + </>* (~a k - hi) 

hieR z 

7: Dual update: a k+1 := a k + 'ff ieSk h k et 

8: Average update: a k+1 = a k + -P J2ieS k 

9: end for 


4.4. SDNA vs SDCA 

We now compare the rates of SDNA and SDCA. The next 
result says that the rate of SDNA is always superior to 
that of SDCA. We also see that the rate is better in the 
quadratic case covered by Theorem 5. 

Theorem 7. If S is uniform sampling with r = E[|<Sj], then 

0(S) < a{ rox < (7 r. 


The proof of the this lemma is provided in the supplemen¬ 
tary material. Theorem 4 (resp. Theorem 5) now follows 
by combining Lemma 3 (resp. Lemma 4) and Lemma 5. 

4.3. New Algorithm: SDCA with Arbitrary Sampling 

When \S\ = 1 with probability 1, SDNA reduces to 
a proximal variant of stochastic dual coordinate ascent 
(SDCA) (Shalev-Shwartz & Zhang, 2013d). However, a 
minibatch version of standard SDCA in the ERM setup 
we consider here has not been previously studied in the 
literature. Takac et al. (2013) developed such a method 
but in the special case of hinge-loss (which is not smooth 
and hence does not fit our setup). Shalev-Shwartz & Zhang 
(2013b) studied minibatching but in conjunction with ac¬ 
celeration and the QUARTZ method of Qu et al. (2014), 
which has been analyzed for an arbitrary sampling S, uses 


Proof Since S’ is a uniform sampling, we have p, = ^ for 
all i £ [n\. In view of (22), we have 1 < Next, 


(24)+(23) 
'31 E 


( ^dmdw 


-1 


T 


Therefore, o-P rox = rnin(L .s-i) > 9(S). In order to 
establish cr \ rox < crj, we use Lemma 2 and the fact that 
E[I§] = ^T to obtain 


(^E[(A T A) § ] + 


I) x = ^ 


( E [( 


^-A T A + I 

7 A n 


S J 


(Lemma 2) 

-< E 


(( 


-]-A T A + I 

7 A n 




A E 


((A T A + 7 Anl)^) 1 (A T A + 7 AnI) 


The rest of the argument is similar. 


□ 
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5. SDNA as Iterative Hessian Sketch 

We now apply SDNA to the least squares problem: 

min d ^II aTw ~ fr|| 2 + tylMI 2 ; (25) 

weR d 2n 2 

and show that the resulting primal update can be interpreted 
as an iterative Hessian sketch, alternative to the one pro¬ 
posed by Pilanci & Wainwright (2014). We first need to 
establish a simple duality result. 

Lemma 6. Let a* be the optimal solution of 

» + (26) 

then the optimal solution w* of (25) is w* = ^Aa*. 

Proof Problem (25) is a special case of (18) for g(w) = 
^||tu|| 2 and <j>i{a) = \{a—bi) 2 for alii G [ra]. Problem (26) 
is the dual of (25) and the result follows from (21). □ 

The interpretation of SDNA as a variant of the Iterative 
Hessian sketch method of Pilanci & Wainwright (2014) fol¬ 
lows immediately from the following theorem. 

Theorem 8. The output sequence {w k . a k } k>o of Algo¬ 
rithm 3 applied on problem (25) satisfies: 

w k+1 = argmin{1|(A T w - 6)|| 2 + ±\\w\\ 2 
weR d ^ n ^ 

+ {w^Al Sk a k -\w k )}, (27) 

where Sj, denotes the n-by- \ Sk submatrix of the identity 
matrix I n with columns in the random subset Sk- 

Proof We know that S J Aa k is the optimal solution of 

mi? xlH| 2 + ( S fc (A T ^ fc +« fc - b), h ) + ^-|l AS fe^l| 2 

hes. T 2 2\n 

Let r = \Sk\- By Lemma 6, the optimal solution of 

1 \ t ) 

min— \\SiA T W +Sj(A T W k +a k -b)r+ — ||tu|| 2 , 

weK d 2|6fc| 2|6fc| 

is given by ^-ASfcS^ Va fe , which equals a k+1 — a k and 
thus equals u^ +1 — w k . Hence, 

w k+1 = argmin{ A|j S ^ (A T u;+a fe —&)|| 2 +|||u)—u) fc || 2 }, 

tu6R d 

which is equivalent to (27) since (I„)s fc = S^S^. □ 
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Figure 1. Comparison of SDNA and SDCA for minibatch sizes r = 1, 32, 256 
on a real (left) and synthetic (right) dataset. The methods coincide for r — 1. 
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Figure 2. Time it takes for SDNA and SDCA to process a singe epoch as a func¬ 
tion of the minibatch size r. 

6. Numerical Experiments 

In our first experiment (Figure 1) we compare SDNA 
and our new minibatch version of SDCA on one real 
(mushrooms; d = 112, n = 8,124) and one synthetic 
(d = 1,024, n = 2,048) dataset. In both cases, we 
used A = 1/n as the regularization parameter and g(w) = 
2||tu|| 2 . As r increases, SDNA requires less passes over 
data (epochs), while SDCA requires more passes over data. 
It can be shown that this behavior can be predicted from the 
complexity results for these two methods. The difference 
in performance depends on the choice of the dataset and 
can be quite dramatic. 

In the second experiment (Figure 2), we investigate how 
much time it takes for the methods to process a single 
epoch, using the same datasets as before. As r increases, 
SDNA does more work as the subproblems it needs to solve 
in each iteration involve a r x r submatrix of the Hessian 
of the smooth part of the dual objective function. On the 
other hand, the work SDCA needs to do is much smaller, 
and does nearly does not increase with the minibatch size r. 
This is because the subproblems are separable. As before, 
all experiments are done using a single core (however, both 
methods would benefit from a parallel implementation). 

Finally, in Figure 3 we put the insights gained from the 
previous two experiments together: we look at the perfor- 
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Figure 3. Runtime of SDNA for minibatch sizes r = 1,4, 16, 32, 64 . 


mance of SDNA for various choices of r by comparing 
runtime and duality gap error. We should expect that in¬ 
creasing r would lead to faster method in terms of passes 
over data, but that this would also lead to slower iterations. 
The question is, is does the gain outweight the loss? The 
answer is: yes, for small enough minibatch sizes. Indeed, 
looking at Figure 3, we see that the runtime of SDNA im¬ 
proved up to the point r = 16 for both datasets, and then 
starts to deteriorate. In situations where it is costly to fetch 
data from memory to a (fast) processor, much larger mini¬ 
batch sizes would be optimal. 
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APPENDIX: Proof of Theorem 3 

It follows directly from Assumption 1 and the update rule x k+1 = x k + ( h k )s k in Algorithm 1 that: 


LHS := /(x fe+ 1 )+^V’i(^ +1 )-/(^)- £><(*<) 

i = l i^S k 


< (V/(x fc ), ( h k ) Sk ) + -(h k , X Sk h k ) + M x i + ti). 

ies k 


Since h k is defined as the minimizer of the right hand side in the last inequality, we can further bound this term by replacing 
h k with h = X(x* — x k ) for arbitrary A £ [0,1]: 


LHS < \{(\/f{x k )) Sk ,x* -x k )+Y. ^ - *<)) + Y {x * ~ xk ' Xs *( x * - **)>• ( 28 > 

teSfc 


Now we use the fact that ipi is 7 ,-strongly convex to obtain: 

n n 

F(x k+1 ) — F(x k ) = f(x k+1 )+Y,M *? +1 ) "/(**)-£>(*?) 


( 28 ) 


< M(yf{x k )) Sk ,x* -x k ) + \ [M x *) - M x i)} 

i&S k 

^(1 A ) /™* n/.A /„,* A 


2 <** - x *■> D(7)s fe i x * ~ x *)) + y (x* - x k ,X Sk (x* - x k )). 

By taking expectations in Sk on both sides of the last inequality, we see that for any A £ [0,1], the following holds: 

E k [F{x k+1 )-F{x k )\ < ^ (^((Vf(x k )),x*-x k ) + ^(Ux*)-M x i))^ 

_ A(1_A) ^ ^ E [d(7) .] {x * _ xk)) + - x k ,E[X § ] (x* - x k )) 

< — (F{x*) - F(x k ) - i<®* - x k ,G(x* - x k )) 

n \ 2 

+ Y(** - * fc ,E [X s + D( 7 y (x* - x k )) * fc ,E [D( 7 ) 5 ] {x* - x k )) 

< ^ (F(x*) - F(x k )) - \{x* - x k , ^(D( 7 ) + G)(x* - x k )) 

+ y(x^ai fc ,E[Xs + ^D( 7 )] {x*-x k )), 

where the second to last inequality follows from Assumption 7 and in the last one we used the fact that E[D( 7 ')^] = ^D( 7 ). 
It remains to replace A by min(l, s). 


APPENDIX: Proof of Lemma 5 

Recall that M = 1 X, where X = =!-A T A. 

n 7 An 

For simplicity in this proof we write 9 = 9{S). First, by the 1-strong convexity of the function g we obtain the 1-smoothness 
of the function g*, from which we deduce: 

-A g*{a k+l ) + A g*(a k ) + A (Vg*{d k ), a k+1 - a k ) > -^(ci fe+1 - a k , a k+1 - a k ). 
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Now we replace Wg*(a k ) by w k and a by ^-Aq to obtain: 

D(a k+1 )-D(a k ) > i 

n ies k 


-(A T w k ,a k+1 
n 


- a 


fc )- 

' 2An 2 


{a K+1 - 


= max 
heR" 



h i ) + ct>*(-a k )\ 


i((A T w k )s k ,h)-±h T X Sk h ] 


where in the last equality we used the dual update rules in Algorithm 3, as well as relations (3) and (2). Therefore, for 
arbitrary h £ R n , 


E k [D(a k+1 ) - D(a k )) > E fc 


“E ["#(-“? ->4)+ #(-«<)] 




- E fc 




= [-#(-<*? - /*) + #(-«*) - (aTu,*)^] - i-/r T E[X § )h. 


Let ur £ K" such that u k = V0,(a/ u> fc ) G K. for all i £ [n\. Let s = (si,..., s n ) £ [0, l] n with Sj = Op,- 1 for all 
i £ [n], where 6 is given in (22). By using hi = —Si(a k + u k ) for all i £ [n] in (29), we get: 


Efc[D(a fc+1 ) - D(a k )\ > - E^h^i M 1 “ s *) a i + s i u i) + <t>i{~ a i) + Si{ajw k ,a k + u k )] 


i= 1 


- — (a fe + u k ) 1 D(a) E[X 5 ]D(s)(a fc + u k ) 


From 7 -strong convexity of the functions </>* we deduce that: 


- 4 >*((1 - s«)(-a*) + Si<) + 0- (-a£) > (-<) - Si0*«) + 


fc 7S,(1 — Si) 


Ui + a: 


fc |2 


Consequently, 

?T -i ^ / -| \ 

Efc[-D(a fe+1 ) - D(a k )} > - [#(-<*?) - («?) + (ajw k , a k + u*>] + - E 7 ? ~ K* + Q i P 


2=1 


2 = 1 


- —(a*+tT ) 1 D(s)E[X^]D(s)(a /c +w /c ) 

- E [#(-“*) + Majw k ) + (ajw k , a k )] + + u k , (I - D( S ))(a fe + u k )) 


2=1 


- ^(a k + u k , D(s) E[X^]D( S )(a fc + u k )) 

where the equality follows from u k = V^>i(a/" w k ). Next, by the definition of 0 in (22), we know that: 

Q 

7I E #7D(p _1 ) + 7— D(u o p -1 ) 

Xn 

= 7D(s) + 7 —D(s)D(v op)D(s) 7 D(s) + ^D(s) E[X^]D(s). 


Finally, it follows that 


0 

n 


E ( _a i) + M a 7 wk ) + ( ajw k , a k )] 


0(P{w k ) ~D(a k )). 


E k [D(a k+1 ) - D(a k )\ > 
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APPENDIX: More insight into the relationship between cr 2 and a ?> 

In the main text we have shown that ct 2 > < 73 , where rr-> is the rate of Method 2 and <73 is the rate of Method 3: NSync 
(Richtarik & Takac, 2013b). In this section we give a more detailed description of the relationship between these two 
quantities in the case when S is the r-nice sampling (Richtarik & Takac, 2012). That is, S picks subsets of [n] of cardinality 
r, uniformly at random. For this sampling. 


Pi := Prob(i £ S) = 

n 

Proposition 3. Suppose that G = M and S be the r-nice sampling. Then there exists /3 £ [1, r] such that one can choose 
Vi = and 

_ _/3c>3_ 

0-2 “ ( 1 - + 

' n— 1 ' r n— 1 ~ 0 

Proof. As explained in (Richtarik & Takac, 2012), (10) is always true if we take Vi = with /3 = r but smaller values 

(leading to a faster algorithm) may be computable if the problem exhibits a property called “partial separability”. 

Let us denote by D the diagonal matrix whose entries are the diagonal entries of M. 

(Mr 


I M m = D iti if i £ S (probability T n ) 
“ I 0 otherwise 


(M 


_ I Mij if i £ S and j £ S (probability ^_*j ) 
S.S] ~ i ^ _■ 


[S,S]A.J 


0 otherwise. 


Hence, 

t „ r(r — 1) 

D H--— 

n nyn — 1 ) 
Let us denote A = M - 1 / 2 DM -1 / 2 and a = —k 


E[Mj] = -D + — D) = - ((1 — ^)D + ^m). 

on nnlnrx — 'W n \ 71— 1 7l~ 1 / 


<73 ( = -/3" 1 A min (M 1 / 2 D- 1 M 1 / 2 ) = -/3- 1 (A max (A)) 
n ^ 


a 2 ( = -jA min (M^ 2 (E[Mj]) _ 1 M 1,/2 ) = ^A min (M 1 / 2 -((l - a)D + cM)'^ 1 / 2 ) 


n, 


T 

= -(A max (M- 1 / 2 ((l-a)D + aM)M - 1 / 2 )) -1 = - (A max ((l - a)A + al)) 


n 


But we have A max ((l — a)A + al) = (1 — a)A max (A) + a, so 

— = ( 1 - + a 
ncr 2 npcr 3 

na 2 1 


T ( l ~ a )lfeg+ a 

CT3 


<7 2 = 


/3<7 3 


(i-a+n"fe 


Note that if 0-3 is small, then er 2 is of the order of , ^x -1 > /3ct 3 ■ 

J- _ 1 


□ 















